26 research outputs found

    A manual for PARTI runtime primitives

    Get PDF
    Primitives are presented that are designed to help users efficiently program irregular problems (e.g., unstructured mesh sweeps, sparse matrix codes, adaptive mesh partial differential equations solvers) on distributed memory machines. These primitives are also designed for use in compilers for distributed memory multiprocessors. Communications patterns are captured at runtime, and the appropriate send and receive messages are automatically generated

    Execution time supports for adaptive scientific algorithms on distributed memory machines

    Get PDF
    Optimizations are considered that are required for efficient execution of code segments that consists of loops over distributed data structures. The PARTI (Parallel Automated Runtime Toolkit at ICASE) execution time primitives are designed to carry out these optimizations and can be used to implement a wide range of scientific algorithms on distributed memory machines. These primitives allow the user to control array mappings in a way that gives an appearance of shared memory. Computations can be based on a global index set. Primitives are used to carry out gather and scatter operations on distributed arrays. Communications patterns are derived at runtime, and the appropriate send and receive messages are automatically generated

    A manual for PARTI runtime primitives, revision 1

    Get PDF
    Primitives are presented that are designed to help users efficiently program irregular problems (e.g., unstructured mesh sweeps, sparse matrix codes, adaptive mesh partial differential equations solvers) on distributed memory machines. These primitives are also designed for use in compilers for distributed memory multiprocessors. Communications patterns are captured at runtime, and the appropriate send and receive messages are automatically generated

    Run-time scheduling and execution of loops on message passing machines

    Get PDF
    Sparse system solvers and general purpose codes for solving partial differential equations are examples of the many types of problems whose irregularity can result in poor performance on distributed memory machines. Often, the data structures used in these problems are very flexible. Crucial details concerning loop dependences are encoded in these structures rather than being explicitly represented in the program. Good methods for parallelizing and partitioning these types of problems require assignment of computations in rather arbitrary ways. Naive implementations of programs on distributed memory machines requiring general loop partitions can be extremely inefficient. Instead, the scheduling mechanism needs to capture the data reference patterns of the loops in order to partition the problem. First, the indices assigned to each processor must be locally numbered. Next, it is necessary to precompute what information is needed by each processor at various points in the computation. The precomputed information is then used to generate an execution template designed to carry out the computation, communication, and partitioning of data, in an optimized manner. The design is presented for a general preprocessor and schedule executer, the structures of which do not vary, even though the details of the computation and of the type of information are problem dependent

    Distributed memory compiler design for sparse problems

    Get PDF
    A compiler and runtime support mechanism is described and demonstrated. The methods presented are capable of solving a wide range of sparse and unstructured problems in scientific computing. The compiler takes as input a FORTRAN 77 program enhanced with specifications for distributing data, and the compiler outputs a message passing program that runs on a distributed memory computer. The runtime support for this compiler is a library of primitives designed to efficiently support irregular patterns of distributed array accesses and irregular distributed array partitions. A variety of Intel iPSC/860 performance results obtained through the use of this compiler are presented

    Krylov methods preconditioned with incompletely factored matrices on the CM-2

    Get PDF
    The performance is measured of the components of the key interative kernel of a preconditioned Krylov space interative linear system solver. In some sense, these numbers can be regarded as best case timings for these kernels. Sweeps were timed over meshes, sparse triangular solves, and inner products on a large 3-D model problem over a cube shaped domain discretized with a seven point template. The performance of the CM-2 is highly dependent on the use of very specialized programs. These programs mapped a regular problem domain onto the processor topology in a careful manner and used the optimized local NEWS communications network. The rather dramatic deterioration in performance was documented when these ideal conditions no longer apply. A synthetic workload generator was developed to produce and solve a parameterized family of increasingly irregular problems

    A scheme for supporting distributed data structures on multicomputers

    Get PDF
    A data migration mechanism is proposed that allows an explicit and controlled mapping of data to memory. While read or write copies of each data element can be assigned to any processor's memory, longer term storage of each data element is assigned to a specific location in the memory of a particular processor. The proposed integration of a data migration scheme with a compiler is able to eliminate the migration of unneeded data that can occur in multiprocessor paging or caching. The overhead of adjudicating multiple concurrent writes to the same page or cache line is also eliminated. Data is presented that suggests that the scheme may be a pratical method for efficiently supporting data migration

    A scheme for supporting automatic data migration on multicomputers

    Get PDF
    A data migration mechanism is proposed that allows an explicit and controlled mapping of data to memory. While read or write copies of each data element can be assigned to any processor's memory, longer term storage of each data element is assigned to a specific location in the memory of a particular processor. Data is presented that suggests that the scheme may be a practical method for efficiently supporting data migration

    MULTIPROCESSORS AND RUNTIME COMPILATION 1

    Get PDF
    ('_I,,c;,__._,,._] f'.?t, no) i"dlLT [_r,!]L.r-s_{j n: _ A _ " 24 p C_:C
    corecore